Here are methods coppied from the lecture.
gd2 <- function(fgh, x0, alpha=1e-2, tol=1e-6, n.max=1e4) {
x <- x0
grad <- fgh(x)[[2]]
n <- 0
while ((max(abs(alpha*grad)) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fgh(x)[[2]]
n <- n + 1
}
if (n < n.max) {
return(list("f"=fgh(x)[[1]], "x"=x, "Number of steps"=n))
} else {
return(NULL)
}
}
line.search2 <- function(fg, x0, alpha=1e-2, tol=1e-8, n.max=1e4) {
g <- function(alpha, x, grad) fg(x+alpha*grad)[[1]]
x <- x0
grad <- fg(x)[[2]]
alpha <- optimize(g, c(0,.1), x=x, grad=grad, maximum=TRUE)$maximum
n <- 0
while (max((abs(alpha*grad)) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fg(x)[[2]]
alpha <- optimize(g, c(0,.1), x=x, grad=grad, maximum=TRUE)$maximum
n <- n + 1
}
if (n < n.max) {
return(list("f"=fg(x)[[1]], "x"=x, "Number of steps"=n))
} else {
return(NULL)
}
}
newton <- function(fgh, x0, tol=1e-8, n.max = 100) {
x <- x0
grad <- fgh(x)[[2]]
H <- fgh(x)[[3]]
n <- 0
while ((max(abs(grad))>tol) & (n < n.max)) {
x <- x - solve(H, grad)
grad <- fgh(x)[[2]]
H <- fgh(x)[[3]]
n <- n + 1
}
if (n < n.max) {
return(list("f"=fgh(x)[[1]], "x"=x, "Number of steps"=n))
} else {
return(NULL)
}
}
gsection <- function(f, x.l, x.r, x.m, tol=1e-8) {
f.l <- f(x.l)[1]
f.r <- f(x.r)[1]
f.m <- f(x.m)[1]
gr <- (1 + sqrt(5))/2
n <- 0
while ((x.r - x.l) > tol) {
n <- n + 1
if ((x.r - x.m) > (x.m - x.l)) {
y <- x.m + (x.m - x.l)/gr
f.y <- f(y)[1]
if (f.y >= f.m) {
x.l <- x.m
f.l <- f.m
x.m <- y
f.m <- f.y
} else {
x.r <- y
f.r <- f.y
}
} else {
y <- x.m - (x.r - x.m)/gr
f.y <- f(y)[1]
if (f.y >= f.m) {
x.r <- x.m
f.r <- f.m
x.m <- y
f.m <- f.y
} else {
x.l <- y
f.l <- f.y
}
}
}
return(list("f"=f(x.m)[1], "x.m"=x.m, "Number of steps"=n))
}
Use the golden-section search algorithm to find all of the local maxima of the function \[ f(x) = \begin{cases} 0 & \text{if } x=0\\ |x|\log(|x|/2)e^{-|x|} & \text{if } x\neq0 \end{cases} \] over the interval \([-10,10]\).
Plot the function first using curve, which helps you set an initial points. Try to use ifelse to define the function.
Using \(f(x,y)=x(x-1)(x+1) - y(y-1)(y+1)\), compare Gradient descent, Line search, and Newton’s method in terms of required steps to reach a local maximum in the region \([-1,0.5]\times[-0.5,1]\). Use \(x_0 = (-.1,.1)\).
X = seq(-1, 1, length=100)
Y = seq(-1, 1, length=100)
XY = expand.grid(X=X, Y=Y) # create a grid
fn <- function(x, y) x*(x-1)*(x+1) - y*(y-1)*(y+1)
Z <- fn(XY$X, XY$Y)
s = interp(x=XY$X, y=XY$Y, z=Z)
plot_ly(x=s$x, y=s$y, z=t(s$z)) %>% add_surface()
In practice, especially for high-dimensional problems, we often have no useful visual information and no idea where to start. In these cases, a common approach is to use random initial points.
Using x0 <- c(runif(1,-.5,3), runif(1,-.5,2)) to generate a random initial point over \([-0.5, 3]\times[-0.5, 2]\), try to find as many local maxima of the following function as possible. \[ f_2(x,y) =\sin\left(\frac{x^2}{2} - \frac{y^2}{4}\right)\cos\left(2x-e^y\right) .\] You should repeat it many times (e.g., 1000 times) and keep a record of local maxima. Which local maxima do you find?
Computation of the gradient is often a non-trivial task in modern optimisation problems.
Recall the gradient descent algorithm for minimising an objective function \(f:\mathbb{R}^d\to\mathbb{R}\): \[ x_{n+1} = x_n - \alpha \nabla f(x_n).\] To implement the algorithm, we need to compute the gradient \(\nabla f(x_n)\) every time we take a step \(\alpha \nabla f(x_n)\) from \(x_n\) to \(x_{n+1}\).
While mathematically it is simply a partial derivative \(\frac{\partial f}{\partial x_i}(x_n)\) for \(i\in\{1,\dots,d\}\). As long as we have this partial derivative function implemented in R for all \(i\), we might just call each of \(d\) functions at every step \(n\in\{1,2,\dots,N\}\), and we reach a local minimum after \(N\) steps. In practice, this naïve idea can easily become infeasible.
First, the required number of steps \(N\) depends on models, the step size \(\alpha\), and quality of initial point \(x_0\). Even in simple 2-D problems \((d=2)\), \(N\) can be thousands.
Second, in modern optimisation problems, it is often the case that \(d\) is quite large, and \(\nabla f\) comprises \(d\) of complicated functions. Even a single evaluation of such \(\nabla f\) may take non-trivial time.
To get a sense of large-scale problems, repeat a simple operation
sum <- sum + (-1)^ia billion times(i in 1:1e9).
In statistics and machine learning, optimisation problems often arise from estimation of model parameters. For example, in deep learning, it is common to have billions of parameters \((d>10^9)\) that define a deep neural network. Estimation of those parameters takes days or even weeks.
In a regression problem, we often have the following minimisation problem: \[\min_{w\in\mathbb{R}^d} \mathcal{C}(w) := \min_{w\in\mathbb{R}^d} \sum_{j=1}^m(y_j-f(x_j;w))^2,\] where \(f\) is a statistical model parameterised by \(w\), and \(\{(x_j,y_j)\}_{j=1}^m\) is data to which we fit \(f\) by solving the above minimisation problem.
To use the gradient descent \[ w_{n+1} = w_n - \alpha \nabla \mathcal{C}(w_n),\] we need to compute \(\nabla \mathcal{C}(w_n)\) at every step \(n\in\{1,2,\dots,N\}\).
Let’s use a simple neural network to understand common challenges of computing the gradient and a workaround. In standard regression problems, a neural network is a real-valued function \(f:\mathbb{R}^k\to\mathbb{R}\) composed in a special way. The composition is typically illustrated using a network diagram as below.
Each arrow represents a scalar number passed between two nodes. Specifically, \(w_{lj}^{(k)}a_j^{(k)}\) is the product of \(w_{lj}^{(k)}\) and \(a_j^{(k)}\), where \(a_j^{(k)}\) as an output of node \(j\) in layer \(L_k\) is multiplied by an associated weight \(w_{lj}^{(k)}\) and passed to node \(l\) in \(L_{k+1}\), where \(k\in\{1,\dots,K\}\). For example, \(w_{32}^{(1)}\) is a weight parameter for an output \(a_2^{(1)}\) from Green#2 passed to Purple#3.
Except for nodes in the input layer, every node represents a function that takes input values from connected nodes in the previous layer and outputs a scalar number. Specifically, an output from node \(l\) in \(L_{k+1}\) is computed as follows: \[\begin{align} z_l^{(k+1)} &= w_{l0}^{(k)} + \sum_{j=1}^{p_{k}}w_{lj}^{(k)}a_j^{(k)}\\ a_l^{(k+1)} &= g^{(k+1)}(z_l^{(k+1)}), \end{align}\] where
The overall function \(f = a_1^{(K)}\) is a composition of these operations.
Even in such a small neural network, there are \((4\times 5 + 5) + (5\times 1 + 1) = 31\) parameters to optimise. You can imagine that, with a high-dimensional input \(x\), several layers, and many nodes in each layer, the number of parameters can be quite large.
Since \(f\) is a composition of explicit \(g\) and linear transformation, using the chain rule, we could in principle write down a partial derivative expression with respect to every single \(w_{lj}^{(k)}\) and separately call each derivative function when needed. But, this would be infeasible for complicated compositions. In the case of neural networks, we should exploit the hierarchical structure of \(f\) for efficient computation. Backpropagation is an optimisation technique that makes fitting complicated models computationally feasible.
When the minimisation problem is \[\min_{w\in\mathbb{R}^d} \mathcal{C}(w) = \min_{w\in\mathbb{R}^d} \sum_{j=1}^m c_j(w),\] a partial derivative of \(\mathcal{C}(w)\) is the sum of partial derivatives of \(c_j(w)\) for all \(j\). In the following derivation, therefore, we focus on a representative data point \((x,y)\) and the associated term \(c(w)\).
In a regression problem, we often have \(c_j(w) := (y_j-f(x_j;w))^2\).
To compute \(c(w)\), first, we evaluate \(f\) at \(x\) using some initial \(w\). In the above figure, this computation flows from the left to the right, so-called forward propagation as it looks like moving input \(x\) forward to \(f(x)\). In addition to the final \(f(x)\) value, along the way, we store intermediary values \(a_l^{(k)}\) and \(z_l^{(k)}\) for all \(l\) and \(k\).
A partial derivative of \(c(w)\) with respect to a weight \(w_{lj}^{(K-1)}\) for the output layer \(L_{K}\) is straightforward: \[\begin{align} \frac{\partial c}{\partial w_{1j}^{(K-1)}} &= \frac{\partial c}{\partial a_1^{(K)}} \frac{\partial a_1^{(K)}}{\partial z_1^{(K)}} \frac{\partial z_1^{(K)}}{\partial w_{1j}^{(K-1)}}\\ &= \delta_1^{(K)} a_j^{(K-1)}, \end{align}\]
A quantity \(\delta_i^{(k)} := \frac{\partial c}{\partial a_i^{(k)}} \frac{\partial a_i^{(k)}}{\partial z_i^{(k)}} = \frac{\partial c}{\partial z_i^{(k)}}\) for \(i\in\{1,\dots,P_{k}\}\) is often called an error for node \(i\) in \(L_k\).
With the following figure as a visual aid, try to compute a partial derivative of \(c(w)\) with respect to a weight \(w_{lj}^{(k-1)}\) for \(L_k\) \((k\in\{2,\dots,K-1\})\).
It looks complicated, but what is happening is simply applying the chain rule several times. We just need to keep track of indices, which is computers are good at and human beings are not…
\[\begin{align} \frac{\partial c}{\partial w_{lj}^{(k-1)}} &= \sum_{i=1}^{P_{k+1}} \frac{\partial c}{\partial z_i^{(k+1)}} \frac{\partial z_i^{(k+1)}}{\partial w_{lj}^{(k-1)}}\\ &= \sum_{i=1}^{P_{k+1}} \delta_i^{(k+1)} \frac{\partial z_i^{(k+1)}}{\partial a_l^{(k)}} \frac{\partial a_l^{(k)}}{\partial z_l^{(k)}} \frac{\partial z_l^{(k)}}{\partial w_{lj}^{(k-1)}}\\ &= \sum_{i=1}^{P_{k+1}} \delta_i^{(k+1)} w_{il}^{(k)} \dot{g}^{(k)}(z_l^{(k)}) a_j^{(k-1)}\\ &= a_j^{(k-1)}\left(\sum_{i=1}^{P_{k+1}} w_{il}^{(k)} \delta_i^{(k+1)}\right) \dot{g}^{(k)}(z_l^{(k)})\\ (&= a_j^{(k-1)} \delta_l^{(k)}) \end{align}\] where \(\dot{g}^{(k)}(z_l^{(k)}) = \frac{\partial a_l^{(k)}}{\partial z_l^{(k)}}\) is the derivative of \(g^{(k)}\) evaluated at \(z_l^{(k)}\) (the other factors are already numbers.)
The key factor is \[\sum_{i=1}^{P_{k+1}} w_{il}^{(k)} \delta_i^{(k+1)}.\] To compute a partial derivative of \(c(w)\) with respect to a weight \(w_{lj}^{(k-1)}\) for \(L_k\) \((k\in\{2,\dots,K-1\})\), we need information about how much this weight \(w_{lj}^{(k-1)}\) (or \(w_{lj}^{(k-1)}a_j^{(k-1)}\) as an input for node \(l\) in \(L_{k}\)) contributes to the objective function \(c(w)\).
For \(k=K\), since there is a direct connection between \(w_{1j}^{(K-1)}\) and \(c(w)\), computation is simple. For \(k<K\), \(w_{lj}^{(k-1)}\) first contributes to an output \(a_l^{(k)}\) at node \(l\) in \(L_k\), which in turns contributes to multiple nodes in \(L_{k+1}\). This is the reason for the above summation — a weighted average of the errors \(\delta_i^{(k+1)}\) at \(L_{k+1}\).
The technique is called backpropagation, because of this information passing in the reverse direction, i.e., sequential computation of \(\delta_l^{(k)}\) for \(k<K\) starting from \(\delta_1^{(K)}\) at the output layer.
Now, you should see how light the computation of each partial derivative is, despite the complicated overall \(f\).